Fast Newton-CG Method for Batch Learning of Conditional Random Fields
نویسندگان
چکیده
We propose a fast batch learning method for linearchain Conditional Random Fields (CRFs) based on Newton-CG methods. Newton-CG methods are a variant of Newton method for high-dimensional problems. They only require the Hessian-vector products instead of the full Hessian matrices. To speed up Newton-CG methods for the CRF learning, we derive a novel dynamic programming procedure for the Hessian-vector products of the CRF objective function. The proposed procedure can reuse the byproducts of the time-consuming gradient computation for the Hessian-vector products to drastically reduce the total computation time of the Newton-CG methods. In experiments with tasks in natural language processing, the proposed method outperforms a conventional quasi-Newton method. Remarkably, the proposed method is competitive with online learning algorithms that are fast but unstable. Introduction Linear-chain Conditional Random Fields (CRFs) model the conditional probability of output sequences (Lafferty, McCallum, and Pereira 2001). They are simple but have been applied to a variety of sequential labeling problems, including natural language processing (Sha and Pereira 2003) and bioinformatics (Chen, Chen, and Brent 2008). The learning task of CRFs can be regarded as the unconstrained minimization of the regularized negative log-likelihood function. Since training CRF models can be computationally intensive, we want an optimization method that converges rapidly. In unconstrained optimization, we minimize an objective function f : R → R that depends on the d dimensional parameter vector θ ∈ R, without constraints on the values of θ. Optimization algorithms generate a sequence of iterates {θk}k=0. In each iteration, typical algorithms try to move from the current point θk to a new point θk+1 along a search direction sk such that fk+1 < fk where fk ≡ f(θk). The gradient descent direction −gk ≡ −∇θfk ∗This work was done when YU was at IBM Research Tokyo. †This work was done when NO was at the University of Tokyo. Copyright c © 2011, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. is the most obvious choice for the search direction. Another choice is the Newton step−(Hk)−1gk ∈ Rd×d, where Hk ≡ ∇θfk is the Hessian matrix of f . The Newton step is the minimizer of the second order Taylor approximation of f : f(θk + s) ≈ fk + g k s+ 12s Hks. Optimization methods using second order information have a fast rate of local convergence. Since the explicit computation of the d × d Hessian matrix is not practical for large d, Newton-CG methods were developed for large problems (Nocedal and Wright 2006). The Newton step is the solution of the Newton equation: Hks = −gk. Newton-CG methods use a Conjugate Gradient (CG) method to solve the Newton equation. Since the CG method only requires the Hessian-vector products of the formHkr ∈ R for an arbitrary vector r ∈ R, Newton-CG methods are less resource demanding. In addition, NewtonCG methods efficiently find a sufficiently good search direction using the adaptive control of the quality of the Newton step. Here are the contributions of this paper. First, to the best of our knowledge, this is the first application of Newton-CG methods to CRF learning. Second, we have devised a procedure that computes the Hessian-vector products of the CRF objective function in polynomial time. We show that the additional computational costs for the Hessian-vector products are relatively small compared to the computational costs for the gradient computations. Finally, we show that the computations for the repetitive Hessian-vector products in the CG iterations can be reduced by reusing the marginal probabilities, which are the byproducts of the costly gradient computation. Since the total effort for Newton-CG methods is dominated by the cumulative costs of the CG iterations, the proposed method can significantly accelerate Newton-CG methods for CRF learning. In experiments on natural language tasks, the proposed method outperforms a conventional quasi-Newton method. In addition, the experimental results also show the proposed method is competitive with online learning algorithms whose generalization/optimization performance is task-dependent. Online learning algorithms have been attracting attention because of their capabilities to handle huge amounts of data with acceptable accuracy (Bottou and Bousquet 2008). However, they have considerable disadvantages compared to batch learning, such as the lack of stopping criProceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence
منابع مشابه
A Fast Accurate Two-stage Training Algorithm for L1-regularized CRFs with Heuristic Line Search Strategy
Sparse learning framework, which is very popular in the field of nature language processing recently due to the advantages of efficiency and generalizability, can be applied to Conditional Random Fields (CRFs) with L1 regularization method. Stochastic gradient descent (SGD) method has been used in training L1-regularized CRFs, because it often requires much less training time than the batch tra...
متن کاملProximal Quasi-Newton for Computationally Intensive L1-regularized M-estimators
We consider the class of optimization problems arising from computationally intensive `1-regularized M -estimators, where the function or gradient values are very expensive to compute. A particular instance of interest is the `1-regularized MLE for learning Conditional Random Fields (CRFs), which are a popular class of statistical models for varied structured prediction problems such as sequenc...
متن کاملConditional Random Fields for Airborne Lidar Point Cloud Classification in Urban Area
Over the past decades, urban growth has been known as a worldwide phenomenon that includes widening process and expanding pattern. While the cities are changing rapidly, their quantitative analysis as well as decision making in urban planning can benefit from two-dimensional (2D) and three-dimensional (3D) digital models. The recent developments in imaging and non-imaging sensor technologies, s...
متن کاملA Stochastic Quasi-Newton Method for Online Convex Optimization
We develop stochastic variants of the wellknown BFGS quasi-Newton optimization method, in both full and memory-limited (LBFGS) forms, for online optimization of convex functions. The resulting algorithm performs comparably to a well-tuned natural gradient descent but is scalable to very high-dimensional problems. On standard benchmarks in natural language processing, it asymptotically outperfor...
متن کاملDiscriminative Learning of Probabilistic Sequence Models for Sequence Labeling Problems
The problem of labeling (or segmenting) sequences is very important in many applications such as part-of-speech tagging in natural language processing, multimodal object detection in computer vision, and DNA/protein structure prediction in bioinformatics. Conditional Random Fields (CRFs) of [1] are known to be the best sequence models ever for the problem. CRF is a conditional model, P (s|y), i...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2011